<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<title>Welcome | Data Science at the Command Line, 2e</title>
<meta name="author" content="Jeroen Janssens">
<meta name="description" content="This thoroughly revised guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small yet powerful command-line tools to quickly obtain, scrub, explore, and model your data. To get you started, author Jeroen Janssens provides a Docker image packed with over 100 Unix power tools—useful whether you work with Windows, macOS, or Linux.">
<meta name="generator" content="bookdown 0.24 with bs4_book()">
<meta property="og:title" content="Welcome | Data Science at the Command Line, 2e">
<meta property="og:type" content="book">
<meta property="og:url" content="https://datascienceatthecommandline.com">
<meta property="og:image" content="https://datascienceatthecommandline.com/og.png">
<meta property="og:description" content="This thoroughly revised guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small yet powerful command-line tools to quickly obtain, scrub, explore, and model your data. To get you started, author Jeroen Janssens provides a Docker image packed with over 100 Unix power tools—useful whether you work with Windows, macOS, or Linux.">
<meta name="twitter:card" content="summary_large_image">
<meta name="twitter:title" content="Welcome | Data Science at the Command Line, 2e">
<meta name="twitter:description" content="This thoroughly revised guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small yet powerful command-line tools to quickly obtain, scrub, explore, and model your data. To get you started, author Jeroen Janssens provides a Docker image packed with over 100 Unix power tools—useful whether you work with Windows, macOS, or Linux.">
<meta name="twitter:image" content="https://datascienceatthecommandline.com/twitter.png">
<!-- JS --><script src="https://cdnjs.cloudflare.com/ajax/libs/clipboard.js/2.0.6/clipboard.min.js" integrity="sha256-inc5kl9MA1hkeYUt+EC3BhlIgyp/2jDIyBLS6k3UxPI=" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/fuse.js/6.4.6/fuse.js" integrity="sha512-zv6Ywkjyktsohkbp9bb45V6tEMoWhzFzXis+LrMehmJZZSys19Yxf1dopHx7WzIKxr5tK2dVcYmaCk2uqdjF4A==" crossorigin="anonymous"></script><script src="https://kit.fontawesome.com/6ecbd6c532.js" crossorigin="anonymous"></script><script src="libs/header-attrs-2.9/header-attrs.js"></script><script src="libs/jquery-3.6.0/jquery-3.6.0.min.js"></script><meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<link href="libs/bootstrap-4.6.0/bootstrap.min.css" rel="stylesheet">
<script src="libs/bootstrap-4.6.0/bootstrap.bundle.min.js"></script><link href="libs/_Source%20Sans%20Pro-0.4.0/font.css" rel="stylesheet">
<link href="https://fonts.googleapis.com/css2?family=Fira%20Mono:wght@400;600&amp;display=swap" rel="stylesheet">
<script src="libs/bs3compat-0.3.1/transition.js"></script><script src="libs/bs3compat-0.3.1/tabs.js"></script><script src="libs/bs3compat-0.3.1/bs3compat.js"></script><link href="libs/bs4_book-1.0.0/bs4_book.css" rel="stylesheet">
<script src="libs/bs4_book-1.0.0/bs4_book.js"></script><link rel="apple-touch-icon" sizes="180x180" href="/apple-touch-icon.png">
<link rel="icon" type="image/png" sizes="32x32" href="/favicon-32x32.png">
<link rel="icon" type="image/png" sizes="16x16" href="/favicon-16x16.png">
<link rel="manifest" href="/site.webmanifest">
<link rel="mask-icon" href="/safari-pinned-tab.svg" color="#d42d2d">
<meta name="apple-mobile-web-app-title" content="Data Science at the Command Line">
<meta name="application-name" content="Data Science at the Command Line">
<meta name="msapplication-TileColor" content="#b91d47">
<meta name="theme-color" content="#ffffff">
<script>
      (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
      (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
      m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
      })(window,document,'script','https://www.google-analytics.com/analytics.js','ga');
      ga('create', 'UA-43246574-3', 'auto');
      ga('send', 'pageview');
    </script><script src="https://cdnjs.cloudflare.com/ajax/libs/autocomplete.js/0.38.0/autocomplete.jquery.min.js" integrity="sha512-GU9ayf+66Xx2TmpxqJpliWbT5PiGYxpaG8rfnBEk1LL8l1KGkRShhngwdXK1UgqhAzWpZHSiYPc09/NwDQIGyg==" crossorigin="anonymous"></script><script src="https://cdnjs.cloudflare.com/ajax/libs/mark.js/8.11.1/mark.min.js" integrity="sha512-5CYOlHXGh6QpOFA/TeTylKLWfB3ftPsde7AnmhuitiTX4K5SqCLBeKro6sPS8ilsz1Q4NRx3v8Ko2IBiszzdww==" crossorigin="anonymous"></script><!-- CSS --><link rel="stylesheet" href="dsatcl2e.css">
</head>
<body data-spy="scroll" data-target="#toc">

<div class="container-fluid">
<div class="row">
  <header class="col-sm-12 col-lg-2 sidebar sidebar-book"><a class="sr-only sr-only-focusable" href="#content">Skip to main content</a>

    <div class="d-flex align-items-start justify-content-between">
      <img id="cover" class="d-none d-lg-block" src="images/cover-small.png"><h1 class="d-lg-none">
        <a href="index.html" title="">Data Science at the Command Line, 2e</a>
      </h1>
      <button class="btn btn-outline-primary d-lg-none ml-2 mt-1" type="button" data-toggle="collapse" data-target="#main-nav" aria-expanded="true" aria-controls="main-nav"><i class="fas fa-bars"></i><span class="sr-only">Show table of contents</span></button>
    </div>

    <div id="main-nav" class="collapse-lg">
      <form role="search">
        <input id="search" class="form-control" type="search" placeholder="Search" aria-label="Search">
</form>
      <nav aria-label="Table of contents"><h2>Table of contents</h2>
        <ul class="book-toc list-unstyled">
<li><a class="active" href="index.html">Welcome</a></li>
<li><a class="" href="foreword.html">Foreword</a></li>
<li><a class="" href="preface.html">Preface</a></li>
<li><a class="" href="chapter-1-introduction.html"><span class="header-section-number">1</span> Introduction</a></li>
<li><a class="" href="chapter-2-getting-started.html"><span class="header-section-number">2</span> Getting Started</a></li>
<li><a class="" href="chapter-3-obtaining-data.html"><span class="header-section-number">3</span> Obtaining Data</a></li>
<li><a class="" href="chapter-4-creating-command-line-tools.html"><span class="header-section-number">4</span> Creating Command-line Tools</a></li>
<li><a class="" href="chapter-5-scrubbing-data.html"><span class="header-section-number">5</span> Scrubbing Data</a></li>
<li><a class="" href="chapter-6-project-management-with-make.html"><span class="header-section-number">6</span> Project Management with Make</a></li>
<li><a class="" href="chapter-7-exploring-data.html"><span class="header-section-number">7</span> Exploring Data</a></li>
<li><a class="" href="chapter-8-parallel-pipelines.html"><span class="header-section-number">8</span> Parallel Pipelines</a></li>
<li><a class="" href="chapter-9-modeling-data.html"><span class="header-section-number">9</span> Modeling Data</a></li>
<li><a class="" href="chapter-10-polyglot-data-science.html"><span class="header-section-number">10</span> Polyglot Data Science</a></li>
<li><a class="" href="chapter-11-conclusion.html"><span class="header-section-number">11</span> Conclusion</a></li>
<li><a class="" href="list-of-command-line-tools.html">List of Command-Line Tools</a></li>
</ul>

        <div class="book-extra">
          <p><a id="book-repo" href="https://github.com/jeroenjanssens/data-science-at-the-command-line">View book repository <i class=""></i></a></p>
        </div>

        <div>
          <a id="course-signup" href="/#course">Embrace the Command Line</a>
        </div>
      </nav>
</div>
  </header><main class="col-sm-12 col-md-9 col-lg-7" id="content"><!--bookdown:title:end--><!--bookdown:title:start--><div id="welcome" class="section level1 unnumbered">
<h1>Welcome<a class="anchor" aria-label="anchor" href="#welcome"><i class="fas fa-link"></i></a>
</h1>
<div class="h1" style="margin-top: 1.5rem;">
Data Science at the Command Line
</div>
<div class="h4">
Obtain, Scrub, Explore, and Model Data with Unix Power Tools
</div>
<div class="cover-in-text">
<div class="inline-figure"><img class="d-block d-lg-none" src="images/cover-small.png"></div>
</div>
<p>Welcome to the website of the second edition of <em>Data Science at the Command Line</em> by Jeroen Janssens, published by O’Reilly Media in October 2021. This website is free to use. The contents is licensed under the <a href="https://creativecommons.org/licenses/by-nc-nd/4.0/">Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License</a>. You can order a physical copy at <a href="https://www.amazon.com/Data-Science-Command-Line-Explore-dp-1492087912/dp/1492087912">Amazon</a>.</p>
<p>Want to learn from Jeroen in person? Through his company, Data Science Workshops, Jeroen provides in-company training about data science at the command line and related topics such as Python, R, and machine learning. Visit <a href="https://datascienceworkshops.com">Data Science Workshops</a> for more information.</p>

<div class="rmdtip">
Jeroen is currently working on a new course <a href="/#course">Embrace the Command Line</a>. If you haven’t fully embraced the command line yet, then this course might be for you. The beta cohort is expected to start in Q1 2021. You can learn more about this course and tell Jeroen what you think <a href="/#course">here</a>.
</div>
<div id="description" class="section level2 unnumbered">
<h2>Description<a class="anchor" aria-label="anchor" href="#description"><i class="fas fa-link"></i></a>
</h2>
<p>This thoroughly revised guide demonstrates how the flexibility of the command line can help you become a more efficient and productive data scientist. You’ll learn how to combine small yet powerful command-line tools to quickly obtain, scrub, explore, and model your data. To get you started, author Jeroen Janssens provides a Docker image packed with over 100 Unix power tools—useful whether you work with Windows, macOS, or Linux.</p>
<p>You’ll quickly discover why the command line is an agile, scalable, and extensible technology. Even if you’re comfortable processing data with Python or R, you’ll learn how to greatly improve your data science workflow by leveraging the command line’s power. This book is ideal for data scientists, analysts, engineers, system administrators, and researchers.</p>
<ul>
<li>Obtain data from websites, APIs, databases, and spreadsheets</li>
<li>Perform scrub operations on text, CSV, HTML, XML, and JSON files</li>
<li>Explore data, compute descriptive statistics, and create visualizations</li>
<li>Manage your data science workflow</li>
<li>Create your own tools from one-liners and existing Python or R code</li>
<li>Parallelize and distribute data-intensive pipelines</li>
<li>Model data with dimensionality reduction, regression, and classification algorithms</li>
<li>Leverage the command line from Python, Jupyter, R, RStudio, and Apache Spark</li>
</ul>
<div class="rmdnote">
If you find this book helpful, consider spreading the word! You could, for instance,
share it on <a href="https://twitter.com/intent/tweet?url=https%3A%2F%2Fdatascienceatthecommandline.com&amp;via=jeroenhjanssens&amp;text=Data%20Science%20at%20the%20Command%20Line%2C%20second%20edition">Twitter</a>,
write a review on <a href="https://www.amazon.com/Data-Science-Command-Line-Explore-dp-1492087912/dp/1492087912">Amazon</a>, or
star the <a href="https://github.com/jeroenjanssens/data-science-at-the-command-line">Github repository</a>. Much appreciated!
</div>
</div>
<div id="praise" class="section level2 unnumbered">
<h2>Praise<a class="anchor" aria-label="anchor" href="#praise"><i class="fas fa-link"></i></a>
</h2>
<blockquote>
<p>
Traditional computer and data science curricula all too often mistake the command line as an obsolete relic instead of teaching it as the modern and vital toolset that it is. Only well into my career did I come to grasp the elegance and power of the command line <span class="keep-together">for easily</span> exploring messy datasets and even creating reproducible data pipelines <span class="keep-together">for work. The</span> first edition of <em>Data Science at the Command Line</em> was one of the <span class="keep-together">most comprehensive and clear</span> references when I was a novice in the art, and now <span class="keep-together">with the second edition,</span> I’m again learning new tools and applications from it.
</p>
<p data-type="attribution">
<strong>Dan Nguyen</strong>, data scientist, former news application developer at ProPublica, and former Lorry I. Lokey Visiting Professor in <span class="keep-together">Professional Journalism at Stanford University</span>
</p>
</blockquote>
<blockquote>
<p>
The Unix philosophy of simple tools, each doing one job well, then cleverly piped <span class="keep-together">together, is</span> embodied by the command line. Jeroen expertly discusses how to <span class="keep-together">bring that philosophy</span> into your work in data science, illustrating how the <span class="keep-together">command line is not only the</span> world of file input/output, but also the <span class="keep-together">world of data manipulation, exploration, and even modeling.</span>
</p>
<p data-type="attribution">
<strong>Chris H. Wiggins</strong>, associate professor in the department of applied physics and applied mathematics at Columbia University, <span class="keep-together">and chief data scientist at <span class="plain">The New York Times</span></span>
</p>
</blockquote>
<blockquote>
<p>
This book explains how to integrate common data science tasks into a <span class="keep-together">coherent workflow. It’s</span> not just about tactics for breaking down problems, <span class="keep-together">it’s also about strategies for assembling the pieces of the solution.</span>
</p>
<p data-type="attribution">
<strong>John D. Cook</strong>, consultant in applied mathematics, <span class="keep-together">statistics, and technical computing</span>
</p>
</blockquote>
<blockquote class="pagebreak-before">
<p>
Despite what you may hear, most practical data science is still focused on interesting <span class="keep-together">visualizations and insights</span> derived from flat files. Jeroen’s book leans into this <span class="keep-together">reality, and helps</span> reduce complexity for data practitioners by showing how <span class="keep-together">time-tested command-line tools</span> can be repurposed for data science.
</p>
<p data-type="attribution">
<strong>Paige Bailey</strong>, principal product manager <span class="keep-together">code intelligence at Microsoft, GitHub</span>
</p>
</blockquote>
<blockquote>
<p>
It’s amazing how fast so much data work can be performed at the command line <span class="keep-together">before ever pulling</span> the data into R, Python, or a database. Older technologies like <span class="keep-together">sed and awk are still</span> incredibly powerful and versatile. Until I read <em>Data Science <span class="keep-together">at the Command Line</span></em>, I had only heard of these tools but never saw their full power. <span class="keep-together">Thanks to Jeroen,</span> it’s like I now have a secret weapon for working with large data.
</p>
<p data-type="attribution">
<strong>Jared Lander</strong>, chief data scientist at Lander Analytics, organizer of the New York Open Statistical Programming Meetup, <span class="keep-together">and author of <span class="plain">R for Everyone</span></span>
</p>
</blockquote>
<blockquote>
<p>
The command line is an essential tool in every data scientist’s toolbox, <span class="keep-together">and knowing it well</span> makes it easy to translate questions you have of your <span class="keep-together">data to real-time insights. Jeroen</span> not only explains the basic Unix philosophy <span class="keep-together">of how to chain together single-purpose</span> tools to arrive at simple solutions <span class="keep-together">for complex problems, but also</span> introduces new command-line tools <span class="keep-together">for data cleaning, analysis, visualization, and modeling</span>.
</p>
<p data-type="attribution">
<strong>Jake Hofman</strong>, senior principal researcher at <span class="keep-together">Microsoft Research,</span> and adjunct assistant professor in the <span class="keep-together">department of applied mathematics at Columbia University</span>
</p>
</blockquote>
</div>
<div id="dedication" class="section level2 unnumbered">
<h2>Dedication<a class="anchor" aria-label="anchor" href="#dedication"><i class="fas fa-link"></i></a>
</h2>
<div style="text-align: center;">
<p><em>Once again to my wife, Esther. Without her continued encouragement, support,<br>
and patience, this second edition would surely have ended up in</em> /dev/null<em>.</em></p>
</div>
</div>
<div id="about-the-author" class="section level2 unnumbered">
<h2>About the Author<a class="anchor" aria-label="anchor" href="#about-the-author"><i class="fas fa-link"></i></a>
</h2>
<p><strong>Jeroen Janssens</strong> is an independent data science consultant and instructor. He enjoys visualizing data, implementing machine learning models, and building solutions using Python, R, JavaScript, and Bash. Jeroen manages <a href="https://datascienceworkshops.com">Data Science Workshops</a>, a training and coaching firm that organizes open enrollment workshops, in-company courses, inspiration sessions, hackathons, and meetups. Previously, he was an
assistant professor at Jheronimus Academy of Data Science and a data scientist at Elsevier in Amsterdam and various startups in New York City. Jeroen holds a PhD in machine learning from Tilburg University and an MSc in artificial intelligence from Maastricht University. He lives with his wife and two kids in Rotterdam, the Netherlands.
You can find Jeroen on <a href="https://twitter.com/jeroenhjanssens">Twitter</a>, <a href="https://github.com/jeroenjanssens">GitHub</a>, and <a href="https://www.linkedin.com/in/jeroenjanssens/">LinkedIn</a>.</p>
</div>
<div id="colophon" class="section level2 unnumbered">
<h2>Colophon<a class="anchor" aria-label="anchor" href="#colophon"><i class="fas fa-link"></i></a>
</h2>
<p>The animal on the cover of <em>Data Science at the Command Line</em> is a wreathed hornbill (<em>Rhytidoceros undulatus</em>). Also known as the bar-pouched wreathed hornbill, the species is found in forests in mainland Southeast Asia and in northeastern India and Bhutan. Hornbills are named for the <em>casques</em> that form on the upper part of the birds’ bills. No single obvious purpose exists for these hollow, keratizined structures, but they may serve as a means of recognition between members of the species, as an amplifier for the birds’ calls, or—because males often exhibit larger casques than females of the species—for gender recognition. Wreathed hornbills can be distinguished from plain-pouched hornbills, to whom they are closely related and otherwise similar in appearance, by a dark bar on the lower part of the wreathed hornbills’ throats.</p>
<p>Wreathed hornbills roost in flocks of up to four hundred but mate in monogamous, lifelong partnerships. With help from the males, females seal themselves up in tree cavities behind dung and mud to lay eggs and brood. Through a slit large enough for his beak alone, the male feeds his mate and their young for up to four months. A diet of animal prey becomes predominantly fruit when females and their young leave the nest. Hornbill couples have been known to return to the same nest for as many as nine years.</p>
<p>Many of the animals on O’Reilly covers are endangered; all of them are important to the world.</p>
<p>The color illustration is by Karen Montgomery, based on a black and white engraving from Braukhaus’s <em>Lexicon</em>. The cover fonts are Gilroy Semibold and Guardian Sans. The text and heading font is Source Sans Pro and the code font is Fira Mono.</p>

<!--A[foreword]
A-->
</div>
</div>
  <div class="chapter-nav">
<div class="empty"></div>
<div class="next"><a href="foreword.html">Foreword</a></div>
</div></main><div class="col-md-3 col-lg-3 d-none d-md-block sidebar sidebar-chapter">
    <nav id="toc" data-toggle="toc" aria-label="On this page"><h2>On this page</h2>
      <ul class="nav navbar-nav">
<li><a class="nav-link" href="#welcome">Welcome</a></li>
<li><a class="nav-link" href="#description">Description</a></li>
<li><a class="nav-link" href="#praise">Praise</a></li>
<li><a class="nav-link" href="#dedication">Dedication</a></li>
<li><a class="nav-link" href="#about-the-author">About the Author</a></li>
<li><a class="nav-link" href="#colophon">Colophon</a></li>
</ul>

      <div class="book-extra">
        <ul class="list-unstyled">
<li><a id="book-source" href="https://github.com/jeroenjanssens/data-science-at-the-command-line/blob/master/book/2e/index.Rmd">View source <i class=""></i></a></li>
          <li><a id="book-edit" href="https://github.com/jeroenjanssens/data-science-at-the-command-line/edit/master/book/2e/index.Rmd">Edit this page <i class=""></i></a></li>
        </ul>
</div>
    </nav>
</div>

</div>
</div> <!-- .container -->

<footer class="bg-primary text-light mt-5"><div class="container-fluid">
    <div class="row">
      <div class="d-none d-lg-block col-lg-2 sidebar"></div>
      <div class="col-sm-12 col-md-9 col-lg-7 mt-3" style="max-width: 45rem;">
        <p><strong>Data Science at the Command Line, 2e</strong> by <a href="https://twitter.com/jeroenhjanssens" class="text-light">Jeroen Janssens</a>. Updated on December 14, 2021. This book was built by the <a class="text-light" href="https://bookdown.org">bookdown</a> R package.</p>
      </div>
      <div class="col-md-3 col-lg-3 d-none d-md-block sidebar"></div>
    </div>
  </div>
</footer>
</body>
</html>
